Learning to use tidyverse for data exploration and modelling and bla bla
2022-05-07
Learning to use tidyverse for data exploration and modelling and bla bla
National Health and Nutrition Examination Survey data concerning glycohemoglobin levels and diabetes mellitus (DM) from https://hbiostat.org/data/.
Why this dataset?
| Variable | Description | Units | Levels |
|---|---|---|---|
| seqn | Unique patient ID | ||
| sex | Gender | 0, 1 | |
| age | Age | Years | 12 - 80 |
| re | Race/ethnicity | 5 levels: White, Black, Mexican, Other Hispanic, Other | |
| income | Family income level | $ | 14 levels from 0 - 100000 |
| tx | On Insulin or Diabetes meds | 0, 1 | |
| dx | Diagnosed with DM or pre-DM | 0, 1 | |
| wt | Weight | kg | 28 - 239.4 |
| ht | Height | cm | 123.3 - 202.7 |
| bmi | Body-mass index | kg/m^2 | 13.18 - 84.87 |
| leg | Upper leg length | cm | 20.4 - 50.6 |
| arml | Upper arm length | cm | 24.8 - 47 |
| armc | Arm circumference | cm | 16.8 - 61 |
| waist | Waist circumference | cm | 52 - 179 |
| tri | Triceps skinfold thickness | mm | 2.6 - 41.1 |
| sub | Subscapular skinfold thickness | mm | 3.8 - 40.4 |
| gh | Glycohemoglobin | % | 4 - 16.4 |
| albumin | Albumin | g/dL | 2.5 - 5.3 |
| bun | Blood urea nitrogen | mg/dL | 1 - 90 |
| SCr | Serum Creatinine | mg/dL | 0.14 - 15.66 |
| Variable | Description | Units | Levels |
|---|---|---|---|
| seqn | Unique patient ID | ||
| sex | Gender | 0, 1 | |
| age | Age | Years | 12 - 80 |
| re | Race/ethnicity | 5 levels: White, Black, Mexican, Other Hispanic, Other | |
| income | Family income level | $ | 14 levels from 0 - 100000 |
| tx | On Insulin or Diabetes meds | 0, 1 | |
| dx | Diagnosed with DM or pre-DM | 0, 1 | |
| wt | Weight | kg | 28 - 239.4 |
| ht | Height | cm | 123.3 - 202.7 |
| bmi | Body-mass index | kg/m^2 | 13.18 - 84.87 |
| leg | Upper leg length | cm | 20.4 - 50.6 |
| arml | Upper arm length | cm | 24.8 - 47 |
| armc | Arm circumference | cm | 16.8 - 61 |
| waist | Waist circumference | cm | 52 - 179 |
| tri | Triceps skinfold thickness | mm | 2.6 - 41.1 |
| sub | Subscapular skinfold thickness | mm | 3.8 - 40.4 |
| gh | Glycohemoglobin | % | 4 - 16.4 |
| albumin | Albumin | g/dL | 2.5 - 5.3 |
| bun | Blood urea nitrogen | mg/dL | 1 - 90 |
| SCr | Serum Creatinine | mg/dL | 0.14 - 15.66 |
DX does not differentiate between type I and type II
| Variable | Description | Units | Levels |
|---|---|---|---|
| seqn | Unique patient ID | ||
| sex | Gender | 0, 1 | |
| age | Age | Years | 12 - 80 |
| re | Race/ethnicity | 5 levels: White, Black, Mexican, Other Hispanic, Other | |
| income | Family income level | $ | 14 levels from 0 - 100000 |
| tx | On Insulin or Diabetes meds | 0, 1 | |
| dx | Diagnosed with DM or pre-DM | 0, 1 | |
| wt | Weight | kg | 28 - 239.4 |
| ht | Height | cm | 123.3 - 202.7 |
| bmi | Body-mass index | kg/m^2 | 13.18 - 84.87 |
| leg | Upper leg length | cm | 20.4 - 50.6 |
| arml | Upper arm length | cm | 24.8 - 47 |
| armc | Arm circumference | cm | 16.8 - 61 |
| waist | Waist circumference | cm | 52 - 179 |
| tri | Triceps skinfold thickness | mm | 2.6 - 41.1 |
| sub | Subscapular skinfold thickness | mm | 3.8 - 40.4 |
| gh | Glycohemoglobin | % | 4 - 16.4 |
| albumin | Albumin | g/dL | 2.5 - 5.3 |
| bun | Blood urea nitrogen | mg/dL | 1 - 90 |
| SCr | Serum Creatinine | mg/dL | 0.14 - 15.66 |
| Variable | Description | Units | Levels |
|---|---|---|---|
| income | Family income level | $ | 14 levels from 0 - 100000 |
Here we assigned the mean from all non-NA values of income.
| Variable | Description | Units | Levels |
|---|---|---|---|
| leg | Upper leg length | cm | 20.4 - 50.6 |
| arml | Upper arm length | cm | 24.8 - 47 |
| armc | Arm circumference | cm | 16.8 - 61 |
| waist | Waist circumference | cm | 52 - 179 |
| tri | Triceps skinfold thickness | mm | 2.6 - 41.1 |
| sub | Subscapular skinfold thickness | mm | 3.8 - 40.4 |
Here we implemented KNN (K=5) in tidyverse. We did not optimize K.
Biochemical variables have more outliers
| Variable | Description | Units | Levels |
|---|---|---|---|
| SCr | Serum Creatinine | mg/dL | 0.14 - 15.66 |
Normal range is 0.6 - 1.2 mg/dL, 5+ indicates severe kidney impairment. We removed all values above 5 (17 total values). Source: https://www.medicinenet.com/creatinine_blood_test/article.htm
Positive correlations primarily betweeen body-size related variables.
Diagnosis status across BMI class
Age as a contributing factor to diagnosis across BMI class
Treatment status of different ethnicity and age
Influence of income and ethnicity on treatment status
Annual income levels and ethinicity do not seem to influence treatment status.
Serum albumin levels in relation to diagnosis
Serum albumin is lower in diagnosed compared to non-diagnosed individuals.
Investigation of patterns concerning diagnosis of diabetes mellitus
Variables dx, tx, leg, arml, wt and ht were excluded
Investigation of patterns in relation to BMI
Variables bmi, wt and ht were excluded
Identify relevant number of clusters
Clusters between age and all other variables
Diagnosis of DM correlates with age, blood glucose, bmi, …. Income and race does not appear to predict DM diagnosis or treatment status Blood glucose overrules other variables in predicting DM diagnosis Cannot cluster patients based on these variables alone
Further research: Appears that older people who have diabetes tend to be treated more often than younger people with diabetes